ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132

abhijain1204fujitsu · 2026-01-27T10:51:35Z

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q4_K_q8_K gemm using i8mm and vector instructions. ARM Neon support for this kernel added in PR #16739

Verifying Feature
----------------------------------------------------------------------------
This PR contains the SVE implementation of the gemm used to compute the Q4_K quantization.

Kernel: ggml_gemm_q4_K_8x8_q8_K()

By running a Q4_K_M quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON (Original)	SVE (This PR)
13.9017 +/- 1.44495	13.8577 +/- 1.44081

This correction does not appear to have any impact on accuracy.

The command used to measure the perplexity measure is

./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw --chunks 4

Performance Check
----------------------------------------------------------------------------

This PR Improves the Prompt Eval time (TTFT) of LLM Inference by 17-20%, as compared to NEON (PR #16739).

The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens per second.

Threads	NEON (Original)	SVE (This PR)	Speedup
4	24.67	29.77	1.20
8	49.05	59.35	1.21
16	97.33	117.62	1.20
32	186.03	221.68	1.19
64	324.55	381.08	1.17

The command used to measure the performance is

llama-bench  --model ${PATH_TO_MODEL} -n 128 -p 128 -t 4,8,16,32,64

This work is a contribution of @Vithulep and @abhijain1204fujitsu

ggerganov · 2026-01-28T07:10:34Z

cc @Alcpz

pvname · 2026-01-28T10:53:16Z

Regarding CI Failure

When I ran the same command on my system, it build correctly with no issue. Can we check or Rerun the CI pipeline.

We have not made any changes in CMake or x86 code.

I am attaching the logs.

cmake -B build -DLLAMA_BUILD_BORINGSSL=ON -DGGML_SCHED_NO_REALLOC=ON
  cmake --build build --config RelWithDebInfo -j ${env:NUMBER_OF_PROCESSORS} --target llama-server 
-- The C compiler identification is GNU 13.1.0
-- The CXX compiler identification is GNU 13.1.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.34.1") 
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM detected flags: -mcpu=zeus+crc+aes+sha3+sm4
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Failed
-- Checking for ARM features using flags:
--   -mcpu=zeus+crc+aes+sha3+sm4+dotprod+i8mm+sve
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Success
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Success
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Success
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Success
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -mcpu=zeus+crc+aes+sha3+sm4+dotprod+i8mm+sve 
-- ggml version: 0.9.5
-- ggml commit:  c3d8907de
-- Fetching BoringSSL version 0.20251002.0
-- Generating embedded license file for target: common
-- Configuring done (26.7s)
-- Generating done (0.4s)
-- Build files have been written to: /home/prashantv/fj-prop-test/llama.cpp/build

Alcpz

Overall I don't see any issues with the existing implementation, so all good from my perspective. Please, also try to run clang-format on your changes, there are some inconsistencies in the style.

Alcpz · 2026-01-28T10:13:28Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

-    constexpr int    q8_k_blocklen = 4;
-    const uint8x16_t m4b           = vdupq_n_u8(0x0f);
+#if defined(__aarch64__) && defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8)
+    if (svcntb()*8 == 256) {


Alcpz · 2026-01-28T10:46:09Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+                        }
+
+                        // q8_ptr[b].qs has interleaved Q8 rows (01, 23)
+                        // const int8_t * q8_base = q8_ptr[b].qs + sb * 256;


There is redundant commented code. Some comments could be improved a bit as well.

Alcpz · 2026-01-28T10:51:44Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+
+        for (int y = 0; y < nr / q8_k_blocklen; y++) {
+            const block_q8_Kx4 * GGML_RESTRICT q8_ptr = (const block_q8_Kx4 *) vy + (y * nb);
+            const block_q8_Kx4 * GGML_RESTRICT q8_ptr_1 = (const block_q8_Kx4 *) vy + (y * nb);


I don't understand the need for the same variable twice, I don't see them being used in a way that makes this necessary. Either clarify or cleanup.

Alcpz · 2026-01-28T10:54:51Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+                acc_f32_67 = svdup_n_f32(0);
+
+                for (int b = 0; b < nb; b++) {
+                    // bsums pairs belongs to the same q8_k subblock   // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum


Suggested change

// bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum

// bsums pairs belongs to the same q8_k subblock

// 64 elements loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum

Alcpz · 2026-01-28T11:00:54Z

Regarding CI Failure

When I ran the same command on my system, it build correctly with no issue. Can we check or Rerun the CI pipeline.

We have not made any changes in CMake or x86 code.

The Server failures are due to changes in the CI. If you rebase on top of master you should get rid of those. I also saw some issues with the `x86 high performance job failing on other pipelines, but as you say, this is not caused here.

abhijain1204fujitsu · 2026-01-29T07:48:18Z

@Alcpz rebase and format related changes are pushed,
Kindly support to further review the PR

Thank you !

abhijain1204fujitsu requested a review from ggerganov as a code owner January 27, 2026 10:51

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2026

Alcpz reviewed Jan 28, 2026

View reviewed changes

Vithulep and others added 5 commits January 29, 2026 10:31

Updated repack.cpp

0a0a010

Updated repack.cpp

c74d605

Updated repack.cpp

cde6298

Added if condition to support only vector length 256.

3b9b4df

Changed the format removed comments and duplicate variable

1d4d342

abhijain1204fujitsu force-pushed the gemm_q4_K_8x8_q8_K_Kernel_SVE_Porting branch from c75f491 to 1d4d342 Compare January 29, 2026 07:46

loci-dev mentioned this pull request Jan 29, 2026

UPSTREAM PR #19132: ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel auroralabs-loci/llama.cpp#1069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132

ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132

abhijain1204fujitsu commented Jan 27, 2026

Uh oh!

ggerganov commented Jan 28, 2026

Uh oh!

pvname commented Jan 28, 2026

Uh oh!

Alcpz left a comment •

edited

Loading

Uh oh!

Alcpz Jan 28, 2026

Uh oh!

Alcpz Jan 28, 2026

Uh oh!

Alcpz Jan 28, 2026

Uh oh!

Alcpz Jan 28, 2026

Uh oh!

Alcpz commented Jan 28, 2026

Uh oh!

abhijain1204fujitsu commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum \|\| 16-23 and 24 - 31 sum
	// bsums pairs belongs to the same q8_k subblock
	// 64 elements loaded and made sum of 0-7 and 8-15 sum \|\| 16-23 and 24 - 31 sum

ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132

Are you sure you want to change the base?

ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132

Conversation

abhijain1204fujitsu commented Jan 27, 2026

Uh oh!

ggerganov commented Jan 28, 2026

Uh oh!

pvname commented Jan 28, 2026

Uh oh!

Alcpz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz commented Jan 28, 2026

Uh oh!

abhijain1204fujitsu commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Alcpz left a comment •

edited

Loading